10 research outputs found
Statistical dependency parsing of Turkish
This paper presents results from the first statistical dependency parser for Turkish. Turkish is a free-constituent order language with complex agglutinative inflectional and derivational morphology and presents interesting challenges for statistical parsing, as in general, dependency relations are between “portions” of words called inflectional groups. We have explored statistical models that use different representational units for parsing. We have used the Turkish Dependency Treebank to train and test our parser but have limited this initial exploration to that subset of the treebank sentences with only left-to-right non-crossing dependency links. Our results indicate that the best accuracy in terms of the dependency relations between inflectional groups is obtained when we use inflectional groups as units in parsing, and when contexts around the dependent are employed
The incremental use of morphological information and lexicalization in data-driven dependency parsing
Typological diversity among the natural languages of the world poses interesting challenges for the models and algorithms used in syntactic parsing. In this paper, we apply a data-driven dependency parser to Turkish, a language characterized by rich morphology and flexible constituent order, and study the effect of employing varying amounts of morpholexical information on parsing performance. The investigations show that accuracy can be improved by using representations based on inflectional groups rather than word forms, confirming earlier studies. In addition, lexicalization and the use of rich morphological features are found to have a positive effect. By combining all these techniques, we obtain the highest reported accuracy for parsing the Turkish Treebank
Dependency parsing of Turkish
The suitability of different parsing methods for different languages is an important topic in
syntactic parsing. Especially lesser-studied languages, typologically different from the languages
for which methods have originally been developed, poses interesting challenges in this respect.
This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative
free constituent order language that can be seen as the representative of a wider class
of languages of similar type. Our investigations show that morphological structure plays an
essential role in finding syntactic relations in such a language. In particular, we show that
employing sublexical representations called inflectional groups, rather than word forms, as the
basic parsing units improves parsing accuracy. We compare two different parsing methods, one
based on a probabilistic model with beam search, the other based on discriminative classifiers and
a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless
of parsing method.We examine the impact of morphological and lexical information in detail and
show that, properly used, this kind of information can improve parsing accuracy substantially.
Applying the techniques presented in this article, we achieve the highest reported accuracy for
parsing the Turkish Treebank
Statistical Dependency Parsing of Turkish
This paper presents results from the first
statistical dependency parser for Turkish.
Turkish is a free-constituent order language
with complex agglutinative inflectional
and derivational morphology and
presents interesting challenges for statistical
parsing, as in general, dependency relations
are between “portions” of words
– called inflectional groups. We have
explored statistical models that use different
representational units for parsing.
We have used the Turkish Dependency
Treebank to train and test our parser
but have limited this initial exploration
to that subset of the treebank sentences
with only left-to-right non-crossing dependency
links. Our results indicate that the
best accuracy in terms of the dependency
relations between inflectional groups is
obtained when we use inflectional groups
as units in parsing, and when contexts
around the dependent are employed
Crowdsourcing Idiomatic Expressions
Learning idiomatic expressions is seen as one of the most challenging stages in second language learning because of their unpredictable meaning. A similar situation holds for their identification within natural language processing applications such as machine translation and parsing. The lack of high-quality usage samples exacerbates this challenge not only for humans but also for artificial intelligence systems. This article introduces a gamified crowdsourcing approach for collecting language learning materials for idiomatic expressions; a messaging bot is designed as an asynchronous multiplayer game for native speakers who compete with each other while providing idiomatic and nonidiomatic usage examples and rating other players entries. As opposed to classical crowd-processing annotation efforts in the field, for the first time in the literature, a crowd-creating & crowd-rating approach is implemented and tested for idiom corpora construction. The approach is language independent and evaluated on two languages in comparison to traditional data preparation techniques in the field. The reaction of the crowd is monitored under different motivational means (namely, gamification affordances and monetary rewards). The results reveal that the proposed approach is powerful in collecting the targeted materials, and although being an explicit crowd-sourcing approach, it is found entertaining and useful by the crowd. The approach has been shown to have the potential to speed up the construction of idiom corpora for different natural languages to be used second language learning material, training data for supervised idiom identification systems, or samples for lexicographic studie
Dependency Parsing of Turkish
The suitability of different parsing methods for different languages is an important topic in syntactic parsing. Especially lesser-studied languages, typologically different from the languages for which methods have originally been developed, pose interesting challenges in this respect. This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative, free constituent order language that can be seen as the representative of a wider class of languages of similar type. Our investigations show that morphological structure plays an essential role in finding syntactic relations in such a language. In particular, we show that employing sublexical units called inflectional groups, rather than word forms, as the basic parsing units improves parsing accuracy. We test our claim on two different parsing methods, one based on a probabilistic model with beam search and the other based on discriminative classifiers and a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless of the parsing method. We examine the impact of morphological and lexical information in detail and show that, properly used, this kind of information can improve parsing accuracy substantially. Applying the techniques presented in this article, we achieve the highest reported accuracy for parsing the Turkish Treebank.<br